Lesson 2 todo
Today we are going to start summarising our variables in our data, in order to be able to start talking about them in a meaningful way, and begin to be able to tell a story with our data. Consider this parliament research briefing on UK prison population statistics. It looks at the number and make up of people in prison in the UK. To do this, it utilises data about people in prison, which you can imagine based on our experience with data last week as a spreadsheet with each individual row representing one individual prisoner. You can also imagine some columns that contain values that correspond to each prisoner, representing a set of variables recorded about him or her. But it would not be very informative to just print out this spreadsheet and hand it to you - or definitely not to hand it to policy makers who are busy, and most likely looking for a summary of headline figures, rather than rows and rows of data. If you did click on the link, you can see that it instead summarises the data in a way that people can read through, and draw meaningful conclusions from.
By reading this report, you can come to know that, at 31st March 2017, the total prison population in England and Wales was just over 85,500. But going further, one of the variables in the data set is the person’s gender. If we want to talk about this one variable - gender - in this data set - prison population - we can turn to univariate analysis of this variable. For example, we could count the number of men versus the number of women in prison. What do you think this will tell us? Do you think there will be equal number of men and women? If you’ve been paying attention in some of your other courses, you’ll likely suspect that there are some gender differences in the prison population. So what do you think the split is like? Do you think it’s 60-40? 70-30? 80-20?
Come on, take a guess, I’ll hold off telling you. Speak to someone next to you. Discuss why you think it’s the split that you think it is. I’ll wait here.
Ready? OK I can tell you now. Actually, according to the count of prison population in September 2017, the number of men in prison in England and Wales is 82,312, while the number of women is 3,982 making the split about 95-5 (see for yourself here). Are you surprised? I definitely was! I had no idea the difference was this large! You can often gain valuable insight into topics that you are interested in by looking into one variable - that is performing univariate analysis on your data. And this is what we will learn to do today. Excited? Yaaay
So you want to analyse your variable. As you have likely pieced together by now, that univariate analysis simply just means - the analysis of one variable. I am giving you a sneek peak into next week’s session now by telling you that bivariate analysis means that you are looking into the relationship between two variables…! And just you wait until we get to multivariate analysis which is the analysis of the relatioship between more than two variables!!!11!! Mind blown.
Right now that we are confident with out terminology, let’s think about our univariate analysis options. We know that we want to be able to talk about one variable in our data set. How can we do this?
Well remember when we spoke about levels of measurement last week? No? Well that was when we looked at the different variables in terms of whether they were nominal, ordinal, or numeric. If it still doesn’t ring a bell, go back to last week’s lab and ctrl + f for these terms. But hopefully you will have retained some of this. Remember we can differentiate between numeric and categorical, and then categorical we can futher sub-divide into nominal and ordinal. Here’s a beautiful and scientific drawing to illustrate:
So why is this important? Well what level of measurement your variable falls into dictates what types of summaries are appropriate. Thinking back to the gender example, it would not make huge amount of sense to calculate the “average gender”, would it? Gender is a nominal variable, and as so an appropriate way to summarise it is not to calculate an average (mean or median, but we will get into this a bit later). Instead there are other approaches you could take. For example you could look into the modal category - which value of the variable occurs the most frequently? For example, in the prison population data above, the modal gender was male, as this was the most frequently occuring variables. But we also wanted to look at the frequency of the other values the variable can take, in this case the frequency of the female gender. So we could look at a frequency table. These are valid ways for summarising categorical variables, which may not work for numeric variables, which are better suited to measures like average and variance. But now I’m just saying words.
Hopefully these words sound familiar from your reading. But it might be that their meanings are not entirely clear just yet. That’s fine. Let’s do some exercises, and demistify these, so that we can get on to telling some interesting stories with our data!
Let’s start with some exercises in talking about categorical variables. We will do this by looking at crime data from 2016-2017 recorded by Greater Manchester Police, available for download from the police.uk website. You don’t have to go download this yourself though, because I have put the data on blackboard for you. So just go to blackboard, and the course content, and then downlod the file gmp_crimes.csv into your working directory. Once you have saved it, open the file using excel.
It should open up like this, with your variable names as column headers:
Under the column headers you have your 420228 rows, one for each of the 420228 crimes in your data. Recall that these are your observarions. Also, that therefore your unit of analysis in this data at this moment is each individual crime.
Anyway let’s say we want to talk about your variables. Like let’s say that you want to talk about the variable crime type. What level of measurement does this variable have? Is it a category? (hint: yes, it’s in the ‘Summarising categorical data’ section, that sort of gives it away…) Does it have a meaningful order? Now this one is one to think about! What do you think? Is crime type nominal or ordinal variable? Do you think it has a meningful order? Take a moment to think about which one you think it is, and most importantly why you think this is the case. Turn to a friend, tell them which one you think it is and also why you think this. Then ask them what they think, as well as their reasoning behind this. Here’s a gif to separate the answer, so you don’t ruin the surprise before you have a chance to discuss.